Objective and Background Information

The purpose of this project is to examine what characteristics/variables make a given song “popular”, and to create an algorithm that can predict whether a song will be popular or not. This algorithm could be used in many ways, such as determining which songs an artist should release if the primary goal is to release a popular song—this could be very useful for record labels trying to release hit songs.

Dataset

The dataset used comprises of around 170,000 songs on Spotify from 1970-2020 and was updated 11 days ago (11/25/2020). This article does a great job of explaining the various “audio features” that Spotify links to a song. Here are the first couple songs in the dataset as a reference:

Our target variable is the “popularity” variable, but because it is a range from 0-100 (as seen below), we’ll make it a binary variable with a popularity >= 50 being a “popular” song, and < 50 being a “not popular” song.

Of course, there is no set cutoff for what would make a song “popular”, however, about 20% of the songs have a popularity >= 50 so that appears to be a fair cutoff. Likewise, the popularity variable is based on the amount of recent plays of a given song, so typically more recent songs will be more popular. Therefore, our analysis will reveal what qualities/variables make a song popular now in time, which would definitely be of more use to a record label/artist than popularity data from the past as trends in music change every year.

Questions

  • What factors contribute to making a song popular

  • Do songs with explicit content have other similar factors helps with families and parents

  • Do certain factors make a song more or less valence (>=.6 happy, .4<neutral<.6, <=.4 sad/angry)

  • Popularity has much to do with energy and danceability, with current waves of tiktok and such, does not take much to be popular, just need to be discovered and have the right aspects

Purpose for Exploration

  • Looking at popularity would be good to see the current cultural trends of the US

  • Helpful when composing music to see what factors play heavily into popularity

  • A reference for comparison can be found here and conducts a similar analysis of popularity on Spotify songs

Methods being Used

We decided to use random forests and xgboost due to their advantages in comprehension, little data preparation, and ability to handle numerical and categorical data.

What Factors Contribute to the Popularity of a Song at the End of 2020?

The first analysis we’ll conduct is building a random forest and eventually a boosted tree to try and predict if a song will be popular and what factors most impact popularity. We’ll also filter the data to only include songs from the 1970s or later, as those are the songs relevant to the analysis.

Cleaning the Data

First some data cleaning is necessary before building the models. The primary tasks are factoring/refactoring certain variables and creating thresholds to make binary or factorable variables.

Exploratory Analysis

Summary Statistics

Below are some relevant statistics regarding the popularity target variable:

Therefore, the base rate of the data is 36.90% and we can also view the breakup of how many songs fall into each decade. The majority are earlier than 2020 which would make sense as a decade consists of 10 years opposed to just 2020. We can also look at the percent of songs that were popular in each decade, and as one might expect there is a constant increase in popularity as time persists, with about 85% of the songs from 2020 being popular.

Data Visualizations

Hypothesis on Important Variables

Based on the above visualizations, we can initially hypothesize that songs that are easier to dance to, have more energy, and have a major scale will be more popular. It is interesting that the spread of the valence for popular vs non popular songs are similar, as we would’ve expected happier songs (higher valence) to be more popular. Furthermore, based on the analysis above I would expect the decade to be very important because more recent songs are more popular, however, the decade is unnecessary in our analysis because it is a rather obvious correlation and won’t give us much insight. Now we’ll build our models.

Random Forest

Testing and Training Data

The dataset will be split 90/10 training and testing.

Mtry level

The Mtry level is the number of variables randomly sampled as candidates at each split. The default number for classification is sqrt(# of variables).

Random Forest - 500 Trees

Initially, we will be generating a random forest made up of 500 trees, and an mtry of 4. In order to ensure that these trees are not all identical and have the opportunity to specialize in different subsets of the data, we will set the argument of replace to TRUE.

Model output:

## 
## Call:
##  randomForest(formula = is_popular ~ ., data = popular_train,      ntree = 500, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 29.36%
## Confusion matrix:
##             not popular popular class.error
## not popular       51186    6267   0.1090805
## popular           20477   13151   0.6089271

Initially we have an out of bag error of 29.36% which is poor but not too bad for our first try. With some tuning of the model we can bring this down. Also, based on the initial confusion matrix we can see the model is far better at correctly classifying unpopular songs rather than popular songs, which makes sense as the dataset is rich in unpopular instances.

Evaluating Model

Accuracy

The accuracy of our initial model is 70.64% which is fair, but is not a great metric because it may be skewed in terms of the model’s ability to predict true positives or true negatives. Let’s dig deeper.

Train vs Predict

If we take a look at the train set frequencies for popular vs unpopular, about 59% of the set are popular songs. However, when looking at the random forest’s predictions, about 27% of the predictions were popular songs. Therefore, we can assess that the model is erring on the side of classifying a song as unpopular.

Variable Importance

We can see that the variables of loudness, danceability, valence, and key are the most important.

Error Visualization

In the above table we can view the out of bag error rate for each individual tree, as well as the difference between the popular and unpopular error rates.

We can also create a plot that expresses each error component visually:

The error terms gradually flatten out after about 200 trees, therefore we can garner that a forest of 500 trees is rather excessive.

Confusion Matrix

##             not popular popular class.error
## not popular       51186    6267   0.1090805
## popular           20477   13151   0.6089271

As mentioned above, the model is far superior at classifying unpopular songs than popular, and also errs on the side of classifying a song as unpopular, which is evident in a false positive rate of 29% and a false negative rate of 32% (although they are relatively close). The sensitivity is 67.7% again telling us the classifier is poor at predicting popular songs.

Error Table

Comparing Random Forests

The 131 model has a tad more popular predictions than the 500 model, however, there is little difference between both.

Below both variable importance plots are displayed, the first for the 500 tree model, the second for the 131 tree model.

There is little difference between the variable importance plots, with loudness, key, danceability, and valence appearing to be the most important variables.

131 Forest Error Visualization

Confusion Matrices

A confusion matrix for the 500 trees is displayed first, then one for 131 trees.

##             not popular popular class.error
## not popular       51186    6267   0.1090805
## popular           20477   13151   0.6089271
##             not popular popular class.error
## not popular       50732    6721   0.1169826
## popular           20190   13438   0.6003925

The confusion matrices for both forests are very similar, telling us that limiting the number of trees in our forest doesn’t do much—therefore, let’s try and tune our model to see if that will improve our sensitivity.

Predictions on Test Data

131 Trees

First we have to use the predict function in R and add it to our test set so that we can use the confusion matrix function. The output of matrix is below:

## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    not popular popular
##   not popular        5590    2231
##   popular             812    1487
##                                           
##                Accuracy : 0.6993          
##                  95% CI : (0.6903, 0.7082)
##     No Information Rate : 0.6326          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2969          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3999          
##             Specificity : 0.8732          
##          Pos Pred Value : 0.6468          
##          Neg Pred Value : 0.7147          
##               Precision : 0.6468          
##                  Recall : 0.3999          
##                      F1 : 0.4943          
##              Prevalence : 0.3674          
##          Detection Rate : 0.1469          
##    Detection Prevalence : 0.2272          
##       Balanced Accuracy : 0.6366          
##                                           
##        'Positive' Class : popular         
## 

As mentioned above, the sensitivity is very poor (40%) and our F1 score (a measure of how good our model is at predicting the positive class) is also low (0.49). Let’s use the tuneRF function to try and improve the model.

Tuning Model

Using the tuneRf function, we are now checking for the optimal number of variables to use/test during the tree building process.

##        mtry  OOBError
## 3.OOB     3 0.2818810
## 5.OOB     5 0.2806184
## 10.OOB   10 0.2832424

After running the function the mtry with the lowest out of bag error rate is 5 so we’ll run a model with that parameter.

Random Forest - 131 trees, mtry = 5

Model output:

## 
## Call:
##  randomForest(formula = is_popular ~ ., data = popular_train,      ntree = 131, mtry = 5, replace = TRUE, sampsize = 100, nodesize = 5,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 131
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 29.8%
## Confusion matrix:
##             not popular popular class.error
## not popular       51012    6441   0.1121090
## popular           20699   12929   0.6155287

Even with the parameter change, the model still doesn’t perform much better (out of bag error rate of 29.8%)

ROC for 131 trees, 4 mtry

As a final metric, we can assess the ROC curve and calculate an AUC (area under curve) which is an expression of the balance of the sensitivity and specificity.

Our AUC comes out to 0.72 which is fair while not great. It is low due to a low sensitivity, as discovered previously.

XGBoosted Tree

As a last resort, let’s see if boosting our tree will aid in the model’s ability to predict the positive class. We’ll use the xgboost (extreme gradient boosting) package which utilizes residual error as a loss function to gradually improve the model. For the model parameters, we decided on a max depth of 6 (as to limit overfitting), an eta of 0.1 (high learning rate), and 400 “passes” through the data.

Some data preparation is necessary but basically just involves one hot encoding the dataset in order to be passed into the xgboost function.

Model Output

Below is the raw output of the boosted model where one can view the parameters and features of the tree.

## ##### xgb.Booster
## raw: 1.2 Mb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, max.depth = 6, eta = 0.1, objective = "binary:logistic")
## params (as set within xgb.train):
##   max_depth = "6", eta = "0.1", objective = "binary:logistic", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
##   cb.evaluation.log()
## # of features: 26 
## niter: 400
## nfeatures : 26 
## evaluation_log:
##     iter train_error
##        1    0.293585
##        2    0.291268
## ---                 
##      399    0.215270
##      400    0.215105

Error Plot

Below is a plot that expresses the decrease in error on the train set as the iterations (400) increase. We can see how quickly the model learns and how the error drops fast, an advantage of the xgboost package.

Variable Importance

We can also pull the variable importance to see if the boosted model selects simlilar important variables as the random forests.

The xgboost model has similar important variables (loudness, valence, danceability) however it is interesting that the duration of the song is ranked so high.

Confusion Matrix

We can also pull a confusion matrix from the model to compare it with the past random forest models.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 5427 1820
##          1  975 1898
##                                          
##                Accuracy : 0.7238         
##                  95% CI : (0.715, 0.7325)
##     No Information Rate : 0.6326         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3761         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.5105         
##             Specificity : 0.8477         
##          Pos Pred Value : 0.6606         
##          Neg Pred Value : 0.7489         
##               Precision : 0.6606         
##                  Recall : 0.5105         
##                      F1 : 0.5759         
##              Prevalence : 0.3674         
##          Detection Rate : 0.1875         
##    Detection Prevalence : 0.2839         
##       Balanced Accuracy : 0.6791         
##                                          
##        'Positive' Class : 1              
## 

The xgboosted tree performs much better than the random forests, as expected due to the more complex nature of xgboost. Below is a comparison of the xgboost model metrics to the 131 tree model (metrics in parenthesis):

  • Accuracy = 72.38% (69.93%)

  • Kappa = 0.38 (0.30)

  • Sensitivity = 51.05% (39.99%)

  • Specificity = 84.77% (87.32%)

  • F1 = 0.58 (0.49)

  • Balanced Accuracy = 67.91% (63.66%)

The primary metric we cared about improving was the sensitivity (better at predicting popular songs) and the xgboost model improved by about 11 percentage points to a value of 51%. While still not excellent, it is much better than the random forest models. Similarly, the F1 score is 0.58 opposed to the 0.49 of the 131 tree random forest, indicating the boosted model is better at classifying the positive outcome. However, we see that the specificity for the boosted model is about 3.5 points lower than the random forest, indicating the false positive rate is higher for the boosted tree—that can be expected because the model classifies more positive cases. Also, when using xgboost we have to be cautious

Conclusions

Overall, after building the various random forest and xgboost models, the xgboost model performed the best. While the sensitivity and rate at which the model correctly classifies the positive class (a popular song) are fairly low, the model still performs decently. Also, because the nature of the classifier is rather complex/subjective (there are no defined variables or method to determine if a song is “popular”) we didn’t expect the models to have excellent prediction metrics. I would recommend that a record label/artist only use the xgboost model to get a sense for how popular a song might be, but would not put to much trust in it’s hands.

Do Songs with Explicit Content Generally Sound the Same?

For the second part of our project, we hope to explore just the songs labeled as having explicit content. We will consider several metrics given in the dataset to judge whether or not these songs “sound” similar. These factors will include energy, valence, danceability, popularity, and “speechiness”.

Exploratory Analysis

We will first look at the songs classified as explicit (using spotify2$explicit to pull the data), and collect some basic summary statistics about the set.

Summary Statistics

We see that the base rate of the data is 11.77%. From this table, we can see that the vast majority of explicit songs have come from the current decade (49% in the 2020s), and decreased progressively as we go further into the past. Now, let’s look at some data visualizations comparing the explicit songs to the factors listed in the aforementioned section.

Data Visualizations

The first box plot compares non-explicit and explicit songs by the danceability rating provided by Spotify. Explicit songs seem to have a slight edge in this metric, although there are an overall smaller number of songs in this category. Likewise, explicit songs seem to have a slightly higher mean in the “energy” metric as well, while non-explicit songs have higher valence. Explicit songs are almost 10 percentage points more popular on average than non-explicit music (around 55 percent to 45 for clean), and have a much higher speechiness value.

In terms of diversity of key selection, the distributions seem to be a bit different between the two categories, despite the vast difference in sample size. That being said, explicit songs tend to have more C# key, while clean music focuses more on C, D, G, and A keys.

The last bar graph illustrates the rise of the proportion of explicit songs in the entire population of the dataset as time progresses. This graph is a pictoral representation of the table we generated above.

Selecting Important Variables

Based off these graphs, I would hypothesize that key selection would be an important factor in distinguishing explicit and non-explicit songs. The C# key is far more prevalent in proportion to other keys in explicit music, while the C, D, G, and A appear more in clean music. Speechiness should also be important, given the large difference in the box plots.

Decision Tree

Model output:

## n= 101201 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 101201 11899 no explicit (0.88242211 0.11757789)  
##    2) speechiness< 0.1435 88366  5385 no explicit (0.93906027 0.06093973) *
##    3) speechiness>=0.1435 12835  6321 explicit (0.49248150 0.50751850)  
##      6) decade=1970s,1980s 2269   196 no explicit (0.91361833 0.08638167) *
##      7) decade=1990s,2000s,2010s,2020 10566  4248 explicit (0.40204429 0.59795571)  
##       14) speechiness< 0.2345 4421  2078 no explicit (0.52997059 0.47002941)  
##         28) danceability< 0.6665 1912   655 no explicit (0.65742678 0.34257322) *
##         29) danceability>=0.6665 2509  1086 explicit (0.43284177 0.56715823) *
##       15) speechiness>=0.2345 6145  1905 explicit (0.31000814 0.68999186) *

Variable Importance:

##      speechiness           decade       popularity     danceability 
##      4735.762749       977.678855       348.586247       130.664930 
##           energy         loudness     acousticness         liveness 
##        57.843841        43.005347        41.511290        33.609057 
##            tempo          valence instrumentalness      duration_ms 
##        23.014297        23.014297        16.489445         6.563738

From these variable importance metrics, we can that speechiness, decade, popularity, and danceability are the four most important variables for predicting a song with explicit content. Our initial hypothesis was slightly correct. I would guess that key does not appear because the size of the dataset means there’s still a massive difference between non-explicit and explicit songs regardless of the specific chord.

Plotted Tree

The decision tree firsts uses speechiness to differentiate between the two categories. If the speechiness is less than 0.14, the song is automatically classified as non-explicit, otherwise, the tree moves on to the second step, which uses decade. If the decade of the song is the 1970s or 1980s, the song is immediately non-explicit, otherwise the model proceeds to the third step, which again uses speechiness. If the speechiness is above 0.23, the song is classified as explicit, otherwise the model proceeds to its fourth and final step, which is determined by danceability. If the danceability metric is below 0.67, the song is non-explicit, otherwise it is classified as explicit.

Plotted CP

Based on the above plot, it appears that four splits is the ideal amount for this model. Our tree above does indeed have four splits.

CP Table

CP nsplit rel error xerror xstd opt
0.0869821 0 1.0000000 1.0000000 0.0086116 1.0086116
0.0252962 2 0.8260358 0.8260358 0.0079170 0.8339528
0.0100000 4 0.7754433 0.7796453 0.0077146 0.7831579

The CP Table confirms the above result that four splits reduces the relative error by a greater margin than 0 and 2.

Evaluating Model

Predicting Values

Now, we will try to determine the optimal model at predicting explicit songs. We will accomplish this by producing a fitted model using the type “class.” Then we will compare the actual results to the predicted ones to determine accuracy

Actual Split

Predicted Split

Confusion Matrix

## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    no explicit explicit
##   no explicit       86311     6236
##   explicit           2991     5663
##                                          
##                Accuracy : 0.9088         
##                  95% CI : (0.907, 0.9106)
##     No Information Rate : 0.8824         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5017         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.47592        
##             Specificity : 0.96651        
##          Pos Pred Value : 0.65438        
##          Neg Pred Value : 0.93262        
##              Prevalence : 0.11758        
##          Detection Rate : 0.05596        
##    Detection Prevalence : 0.08551        
##       Balanced Accuracy : 0.72121        
##                                          
##        'Positive' Class : explicit       
## 

The confusion matrix above tells us that the overall accuracy of our model is 90.88%, which is pretty good, especially for a dataset of this magnitude. The f1 score, which considers the precision (true positive rate) and recall (Sensitivity rating) is 0.55, which is also a pretty good result. The detection rate,the rate at which the algorithm detects the positive class in proportion to the entire classification [A/(A+B+C+D) where A is true positives] is 0.056.

ROC

## 
## Call:
## roc.default(response = explicit_dataset$explicit, predictor = as.numeric(explicit_fitted_model),     plot = TRUE)
## 
## Data: as.numeric(explicit_fitted_model) in 89302 controls (explicit_dataset$explicit no explicit) < 11899 cases (explicit_dataset$explicit explicit).
## Area under the curve: 0.7212

The ROC value is another metric which depicts the accuracy of our model. As we can see from the above calculation, the area under the curve (or AUC) value for our predicted model is 0.7212, which is pretty good but could definitely be improved. Let’s change the thresholds and run a random forest to see if we can create an optimal model.

Changing thresholds

## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    no explicit explicit
##   no explicit       89302    11899
##   explicit              0        0
##                                           
##                Accuracy : 0.8824          
##                  95% CI : (0.8804, 0.8844)
##     No Information Rate : 0.8824          
##     P-Value [Acc > NIR] : 0.5024          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8824          
##               Precision :     NA          
##                  Recall : 0.0000          
##                      F1 :     NA          
##              Prevalence : 0.1176          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : explicit        
## 
threshold accuracy tpr fpr kappa f1
0.2 0.9029 0.53097 0.04757 0.5081 0.56247
0.4 0.9088 0.47592 0.03349 0.5017 0.55106
0.6 0.9055 0.35633 0.02133 0.4238 0.46996

It appears like 0.2 is the ideal threshold for this model, as it has the highest TPR and f1 values. Now we can set up our Random Forest.

Random Forest

Testing and Training Data

The dataset will be split 90/10 training and testing.

Mtry level

The Mtry level is the number of variables randomly sampled as candidates at each split. The default number for classification is sqrt(# of variables).

The mytry comes out to 3.6 which we’ll round to 4.

Random Forest - 500 Trees

Initially, we will be generating a random forest made up of 500 trees, and an mtry of 4. In order to ensure that these trees are not all identical and have the opportunity to specialize in different subsets of the data, we will set the argument of replace to TRUE.

Model Output:

## 
## Call:
##  randomForest(formula = explicit ~ ., data = explicit_train, ntree = 500,      mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE,      proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE,      keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 9.41%
## Confusion matrix:
##             no explicit explicit class.error
## no explicit       79733      647 0.008049266
## explicit           7927     2774 0.740771890

Evaluating Model

Accuracy

The overall accuracy of our 500 tree model is about 90.5%, which is pretty good but not much better than our initial model.

Actual and Predicted

Important Variables

no explicit explicit MeanDecreaseAccuracy MeanDecreaseGini
valence 0.0024474 0.0014454 0.0023297 0.8000776
acousticness 0.0014462 0.0063287 0.0020197 0.8326905
danceability 0.0047170 0.0294600 0.0076238 1.8215381
duration_ms 0.0004881 0.0009567 0.0005431 0.7652550
energy 0.0025200 0.0023346 0.0024981 0.8460576
instrumentalness 0.0004481 0.0179402 0.0025030 0.7458689
key 0.0002385 0.0066839 0.0009957 2.0894633
liveness 0.0004334 0.0004604 0.0004366 0.7669011
loudness 0.0025724 0.0130235 0.0038002 0.9867742
mode -0.0000509 0.0009839 0.0000707 0.1285905
popularity 0.0008627 0.0253006 0.0037336 1.5071601
speechiness 0.0136789 0.1365978 0.0281193 4.7354641
tempo 0.0011692 0.0012078 0.0011738 0.7863611
decade 0.0011948 0.0276167 0.0042987 1.0702378

Data Visualization

Confusion Matrix

##             no explicit explicit class.error
## no explicit       79733      647 0.008049266
## explicit           7927     2774 0.740771890

Error Tables

Looking at this error table, we can see a few interesting values pop out immediately, most notably 248, which has the among the lowest Out of Box (OOB) and Popular Error. Thus we will pick this value to run our optimal tree.

Comparing Random Forests

Both variable importance plots are displayed, the first for the 500 tree model, the second for the 248 tree model.

In both the meanDecreaseAccuracy and meanDecreaseGini categories, speechiness is far and away the most important variable in identifying explicit songs, followed by danceability, popularity, and decade. Key is also important for meanDecreaseGini, which further supports the initial hypothesis made at the beginning.

248 Forest Error Visualization

Confusion Matrices

A confusion matrix for the 500 trees is displayed first, then one for 248 trees.

##             no explicit explicit class.error
## no explicit       79733      647 0.008049266
## explicit           7927     2774 0.740771890
##             no explicit explicit class.error
## no explicit       79611      769 0.009567056
## explicit           7615     3086 0.711615737

Predictions on Test Data

248 Trees

First we use the predict function in order to create a confusion matrix.

## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    no explicit explicit
##   no explicit        8832      847
##   explicit             90      351
##                                          
##                Accuracy : 0.9074         
##                  95% CI : (0.9016, 0.913)
##     No Information Rate : 0.8816         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3894         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.29299        
##             Specificity : 0.98991        
##          Pos Pred Value : 0.79592        
##          Neg Pred Value : 0.91249        
##               Precision : 0.79592        
##                  Recall : 0.29299        
##                      F1 : 0.42831        
##              Prevalence : 0.11838        
##          Detection Rate : 0.03468        
##    Detection Prevalence : 0.04358        
##       Balanced Accuracy : 0.64145        
##                                          
##        'Positive' Class : explicit       
## 

Tuning Model

Using the tuneRf function, we are now checking for the optimal number of variables to use/test during the tree building process.

##        mtry   OOBError
## 3.OOB     3 0.07141994
## 5.OOB     5 0.07163953
## 10.OOB   10 0.07140897

The tuneRF results show the 10 is probably the best mTry value to test, however not by a very large margin.

Conclusions

Through this process, we have identified several important quantifiers for classifying explicit songs without doing any lyric analysis. We can sufficiently conclude that Speechiness is the most important variable in identifying explicit music, as the importance metric of the random forests show. In addition, danceability, key, and popularity are also important variables to consider.

What Makes a Song Possess High or Low Valence?

The final part of the dataset that we will consider consists of the factors that qualify a song as possessing high or low valence.

Exploratory Analysis

Correlation between Quantitative Variables

valence acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo year
valence 1.0000 -0.2102 0.5198 -0.1503 0.3317 -0.2318 -0.0252 0.2493 0.0272 0.1061 -0.1597
acousticness -0.2102 1.0000 -0.1930 -0.0508 -0.7079 0.1999 -0.0677 -0.5568 -0.1070 -0.1585 -0.1467
danceability 0.5198 -0.1930 1.0000 -0.1113 0.1297 -0.2724 -0.1306 0.2538 0.1953 -0.1098 0.1677
duration_ms -0.1503 -0.0508 -0.1113 1.0000 0.0000 0.1153 0.0540 -0.0597 -0.0381 -0.0353 -0.1084
energy 0.3317 -0.7079 0.1297 0.0000 1.0000 -0.1996 0.1756 0.7518 0.1507 0.2051 0.1447
instrumentalness -0.2318 0.1999 -0.2724 0.1153 -0.1996 1.0000 -0.0275 -0.3964 -0.1049 -0.0673 -0.0681
liveness -0.0252 -0.0677 -0.1306 0.0540 0.1756 -0.0275 1.0000 0.0743 0.1396 0.0160 -0.0511
loudness 0.2493 -0.5568 0.2538 -0.0597 0.7518 -0.3964 0.0743 1.0000 0.1190 0.1669 0.3386
speechiness 0.0272 -0.1070 0.1953 -0.0381 0.1507 -0.1049 0.1396 0.1190 1.0000 0.0346 0.1844
tempo 0.1061 -0.1585 -0.1098 -0.0353 0.2051 -0.0673 0.0160 0.1669 0.0346 1.0000 0.0135
year -0.1597 -0.1467 0.1677 -0.1084 0.1447 -0.0681 -0.0511 0.3386 0.1844 0.0135 1.0000

From this table we will remove acousticness because of it’s extremely low correlation with energy.

Summary Statistics

The dataset will be divided into three categories: a >0.6 valence value will be considered as “happy/cheerful”, 0.4 to 0.6 as “neutral” and less than 04 as “sad/depressed.” The results for each decade are displayed below:

decade valence_fact count
1970s happy/cheerful 10128
1970s neutral 6616
1970s sad/depressed 3256
1980s happy/cheerful 9654
1980s neutral 6177
1980s sad/depressed 4019
1990s happy/cheerful 9001
1990s neutral 6540
1990s sad/depressed 4360
2000s happy/cheerful 8301
2000s neutral 6959
2000s sad/depressed 4386
2010s happy/cheerful 5737
2010s neutral 8140
2010s sad/depressed 5897
2020 happy/cheerful 690
2020 neutral 920
2020 sad/depressed 420

The various base rates are as follows:

  • Happy = 42.99%

  • Neutral = 34.93%

  • Sad = 22.07%

Data Visualizations

As we would initially expect, happy songs tend to have the highest energy level, followed by neutral and sad. The tempo of the three categories all tend to hover between 100 and 125, and the popularity between 37 and 50, without a real statistically significant difference. In terms of key, the graphs look pretty much the exact same across all three categories. Finally, the bar graph shows a rise in “neutral” and “sad” songs in recent decades, whereas the amount of happy songs has gradually decreased over time.

Selecting Important Variables

Instead of a random forest, we will use a kNN/k-means model to determine the valence of a song using its k-nearest neighbors.

kNN/k-Means

Training/Testing Data

For our algorithm, we have decided to do a 90/10 split of the data for training and testing.

3NN

The first section of our model will be a 3-nearest neighbors model, using the above 90/10 train/test split.

Evaluating Model

Confusion Matrix and Accuracy

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  10120 
## 
##  
##                           | valence_3NN 
## valence_test$valence_fact | happy/cheerful |        neutral |  sad/depressed |      Row Total | 
## --------------------------|----------------|----------------|----------------|----------------|
##            happy/cheerful |           2971 |           1100 |            283 |           4354 | 
##                           |          0.682 |          0.253 |          0.065 |          0.430 | 
##                           |          0.641 |          0.328 |          0.133 |                | 
##                           |          0.294 |          0.109 |          0.028 |                | 
## --------------------------|----------------|----------------|----------------|----------------|
##                   neutral |           1323 |           1471 |            752 |           3546 | 
##                           |          0.373 |          0.415 |          0.212 |          0.350 | 
##                           |          0.285 |          0.439 |          0.353 |                | 
##                           |          0.131 |          0.145 |          0.074 |                | 
## --------------------------|----------------|----------------|----------------|----------------|
##             sad/depressed |            343 |            781 |           1096 |           2220 | 
##                           |          0.155 |          0.352 |          0.494 |          0.219 | 
##                           |          0.074 |          0.233 |          0.514 |                | 
##                           |          0.034 |          0.077 |          0.108 |                | 
## --------------------------|----------------|----------------|----------------|----------------|
##              Column Total |           4637 |           3352 |           2131 |          10120 | 
##                           |          0.458 |          0.331 |          0.211 |                | 
## --------------------------|----------------|----------------|----------------|----------------|
## 
## 

Our 3-NN model resulted in an overall accuracy rate of 54.72%, which isn’t very good considering the breadth of the dataset. Let’s see if we can determine an optimal k-value that will better increase the overall accuracy of our prediction:

Choosing Optimal K

Looking at the above dataframe, the overall accuracy of the model gradually increases with a higher choice of k, culminating with a value around 59.6% for a 21-NN model. However, the difference in accuracy begins to flatten out around 6 nearest neighbors, so this is what we will pick for our optimized model.

Elbow Plot:

By using the elbow plot above, we see that the marginal increase of accuracy stops at around k=6 so now we’ll build a model with 6 nearest neighbors.

6NN

Evaluating Model

COnfusion Matrix and Accuracy

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  10120 
## 
##  
##                           | valence_6NN 
## valence_test$valence_fact | happy/cheerful |        neutral |  sad/depressed |      Row Total | 
## --------------------------|----------------|----------------|----------------|----------------|
##            happy/cheerful |           3086 |           1080 |            188 |           4354 | 
##                           |          0.709 |          0.248 |          0.043 |          0.430 | 
##                           |          0.661 |          0.309 |          0.096 |                | 
##                           |          0.305 |          0.107 |          0.019 |                | 
## --------------------------|----------------|----------------|----------------|----------------|
##                   neutral |           1286 |           1593 |            667 |           3546 | 
##                           |          0.363 |          0.449 |          0.188 |          0.350 | 
##                           |          0.275 |          0.456 |          0.341 |                | 
##                           |          0.127 |          0.157 |          0.066 |                | 
## --------------------------|----------------|----------------|----------------|----------------|
##             sad/depressed |            300 |            818 |           1102 |           2220 | 
##                           |          0.135 |          0.368 |          0.496 |          0.219 | 
##                           |          0.064 |          0.234 |          0.563 |                | 
##                           |          0.030 |          0.081 |          0.109 |                | 
## --------------------------|----------------|----------------|----------------|----------------|
##              Column Total |           4672 |           3491 |           1957 |          10120 | 
##                           |          0.462 |          0.345 |          0.193 |                | 
## --------------------------|----------------|----------------|----------------|----------------|
## 
## 

The overall accuracy of our model slightly increased to 57.12% here, indicating that a 6-NN model would probably be preferred to our 3-NN one. The 6-NN model was also marginally better in each category at predicting the correct outcome. Further testing might be needed to further optimize this model for an even greater accuracy.

Conclusions

The k-NN/k-Means model may not be the best method to use when trying to determine song valence. Compared to the first two random forests (which had overall accuracy rates in the low 90s), this model wasn’t much better than randomly guessing. The algorithm correctly outputted the valence of a song only 57.12% of the time, which indicates that it is definitely not good enough for use in any analytics setting.

Future Work

In the future, the random forest models could be implemented on individual user data, to determine what their preferred music taste is, and recommend suggestions based on their likes and dislikes. Our explicit content random forest could be a part of a parental controls filter, which prevents children from discovering and listening to explicit music on certain platforms. In addition, artists, music engineers and record labels could use our popularity xgboost model to determine the best formulas for creating hit songs to maximize their profit and increase listener count. However, one must take into account that the popularity of music is inherently subjective and there is no set formula to create a hit song, especially with rapid changing trends spurred on by social media apps such as TikTok. Nonetheless, the popularity models could be implemented every now and then to try and track changing trends and jump on them to try and maximize profit.